Large result set retrieval: streaming-only, all split types, query rewrite fix by schenksj · Pull Request #128 · indextables/tantivy4java

schenksj · 2026-03-13T02:33:18Z

Summary

Streaming retrieval for all split types: startStreamingRetrieval() now works for both companion (parquet) and regular (tantivy doc store) splits
Fused path eliminated: searchAndRetrieveArrowFfi() deprecated and reimplemented as streaming wrapper — single code path reduces bug surface
Non-companion streaming: New streaming_doc_retrieval.rs converts tantivy doc store documents to Arrow RecordBatches via doc_async() with bounded memory
Query rewrite bug fix: perform_bulk_search() now applies the same companion query rewrites as the regular search() path — fixes IS NOT NULL returning 0 rows and exact_only EqualTo returning nothing
Memory safety: ARC_REGISTRY-based session handles, Arrow FFI buffer leak prevention, bounds-checked array access
Adaptive I/O: Per-file read strategy based on selectivity to minimize S3 GET requests (companion path)

Changes since initial PR

Deleted fused_retrieval.rs — single streaming path for all result sizes
Added streaming_doc_retrieval.rs — non-companion split streaming via tantivy doc store
Fixed bulk_retrieval.rs — added rewrite_companion_query() for FieldPresence → _phash_* and exact_only → hash rewrites, plus ensure_fast_fields_for_query() for parquet-transcoded fast fields
Added StreamingRetrievalSession::new() constructor in streaming_ffi.rs
JNI nativeStartStreamingRetrieval now routes companion vs regular splits
Updated Java javadoc and error messages to reflect all-split-type support
Added 5 regression tests for streaming + field presence, IS NULL, exact_only queries

Files changed

native/src/split_searcher/fused_retrieval.rs — DELETED
native/src/split_searcher/streaming_doc_retrieval.rs — NEW (non-companion streaming)
native/src/split_searcher/bulk_retrieval.rs — query rewrite fix
native/src/split_searcher/document_retrieval/doc_retrieval_jni.rs — dual-path routing
native/src/parquet_companion/streaming_ffi.rs — new constructor
native/src/split_searcher/mod.rs — module registration
SplitSearcher.java — updated javadoc, deprecated wrapper
LargeResultSetRetrievalTest.java — removed fused tests, added regression tests
docs/LARGE_RESULT_SET_DEVELOPER_GUIDE.md — updated architecture diagram

Test plan

cargo check passes
cargo test --lib streaming_doc_retrieval — 4 unit tests pass
mvn test — 1261 tests, 0 failures, 0 errors
Streaming IS NOT NULL returns correct row count (regression test)
Streaming IS NULL returns 0 when all rows have field (regression test)
Streaming exact_only term query finds matching document (regression test)
Non-companion split streaming works end-to-end (regression test)

🤖 Generated with Claude Code

…e APIs Implements three-tier companion-mode bulk retrieval for returning thousands to millions of rows from parquet-backed splits via Arrow FFI with bounded memory. Includes DocIdCollector (no-score search), fused single-call path, streaming session pipeline with adaptive I/O strategy, ARC_REGISTRY-based session handles for memory safety, and StreamingSession AutoCloseable Java wrapper with double-close protection. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

claude · 2026-03-13T02:33:33Z

Claude finished @schenksj's task —— View job

Code Review Complete

Reviewing PR #128 for:

Correctness bugs (logic errors, off-by-one, null safety)
Security issues (injection, unsafe operations, credential handling)
Merge conflict markers or build-breaking issues
Missing error propagation (especially JNI error swallowing)
API consistency with existing patterns

Todo List

Review new Rust modules for memory safety and correctness
Check JNI error handling and propagation
Validate Java API changes for consistency
Look for potential security vulnerabilities
Check session management and resource cleanup
Review streaming pipeline implementation
Validate Arrow FFI buffer management

Summary: Found 4 Issues (1 Critical, 2 Moderate, 1 Minor)

✅ Good Architecture Design

Proper use of ARC_REGISTRY for session management (avoids use-after-free)
Sound streaming pipeline with bounded memory (~24MB max)
Appropriate separation of fused vs streaming paths
Good error handling in most areas

🔴 CRITICAL: Session Handle Memory Leak

File: SplitSearcher.java - StreamingSession class
Issue: Found proper close() implementation using nativeCloseStreamingSession() - FALSE ALARM

✅ RESOLVED - The StreamingSession class correctly implements AutoCloseable with proper native cleanup in the close() method.

🟡 MODERATE: Integer Overflow Risk

File: bulk_retrieval.rs:50-51

let total: usize = segment_fruits.iter().map(|v| v.len()).sum();
let mut result = Vec::with_capacity(total);

Issue: On very large result sets, total can overflow since each segment could contain millions of documents.

Risk: Panic on allocation or incorrect capacity calculation.

Recommendation:

let total = segment_fruits.iter()
    .map(|v| v.len())
    .try_fold(0usize, |acc, len| acc.checked_add(len))
    .ok_or_else(|| anyhow!("Result set too large: document count overflow"))?;

🟡 MODERATE: Silent Error Swallowing in Stream

File: streaming_ffi.rs:215-217

if tx.send(Ok(normalized)).await.is_err() {
    perf_println!("⏱️ STREAMING: consumer dropped — stopping producer");
    return Ok(()); // ← Silent success despite potential error
}

Issue: Producer treats channel closure as success, potentially masking critical errors.

Risk: Silent data loss if consumer fails early.

Recommendation: Distinguish early failure from normal completion:

if tx.send(Ok(normalized)).await.is_err() {
    if rows_emitted == 0 {
        return Err(anyhow!("Consumer dropped before any data sent - possible error"));
    }
    perf_println!("⏱️ STREAMING: consumer dropped after {} rows", rows_emitted);
    return Ok(());
}

🟢 MINOR: Missing Bounds Check

File: streaming_ffi.rs:195

let batches = read_parquet_batches_for_file(/* ... */);

Issue: No validation that file_idx is within manifest.parquet_files bounds.

Risk: Potential panic on malformed input.

Recommendation: Add bounds check:

if file_idx >= manifest.parquet_files.len() {
    return Err(anyhow!("Invalid file_idx {} >= {}", file_idx, manifest.parquet_files.len()));
}
let file_entry = &manifest.parquet_files[file_idx];

✅ No Issues Found In:

Memory safety (proper Arc management)
JNI error propagation (well handled)
API consistency (matches existing patterns)
Security vulnerabilities (no unsafe operations)
Merge conflicts (clean)

Overall Assessment: Solid implementation with good architecture. The identified issues are edge cases that should be addressed for production robustness.

…check 1. merge_fruits: use checked_add to prevent usize overflow on very large result sets instead of unchecked sum 2. Streaming producer: distinguish early consumer drop (rows_emitted==0, returns error) from normal mid-stream drop (returns Ok) 3. Streaming producer: validate file_idx bounds before indexing into manifest.parquet_files to prevent panic on malformed input Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

claude · 2026-03-13T02:44:28Z

Claude finished @schenksj's task —— View job

****

Code Review Complete

Reviewing PR #128 for:

Correctness bugs (logic errors, off-by-one, null safety)
Security issues (injection, unsafe operations, credential handling)
Merge conflict markers or build-breaking issues
Missing error propagation (especially JNI error swallowing)
API consistency with existing patterns

Todo List

Review new Rust modules for memory safety and correctness
Check JNI error handling and propagation
Validate Java API changes for consistency
Look for potential security vulnerabilities
Check session management and resource cleanup
Review streaming pipeline implementation
Validate Arrow FFI buffer management

Summary: Found 1 Minor Issue

✅ Good Architecture Design

Proper use of ARC_REGISTRY for session management (avoids use-after-free)
Sound streaming pipeline with bounded memory (~24MB max)
Appropriate separation of fused vs streaming paths
Good error handling in most areas
Integer overflow protection already implemented in docid_collector.rs with checked_add()
Proper bounds checking for file indices in streaming_ffi.rs:171-176

🟢 MINOR: Missing Bounds Check in Arrow-to-Tantivy Module

File: streaming_ffi.rs:197

let batches = read_parquet_batches_for_file(
    file_idx,
    &rows,
    projected_fields.as_deref().map(|s| s as &[String]),
    manifest,
    storage,
    metadata_cache,
    byte_cache,
    effective_coalesce,
).await?;

Issue: While bounds checking exists at line 171-176, the read_parquet_batches_for_file function receives file_idx that could potentially access arrays without additional validation in the callee.

Risk: Potential panic if the downstream function doesn't validate the file index.

Recommendation: Verify that read_parquet_batches_for_file includes its own bounds validation, or add a comment confirming that validation occurs at line 171-176.

✅ No Issues Found In:

Memory safety (proper Arc management with ARC_REGISTRY)
JNI error propagation (comprehensive error handling with to_java_exception)
API consistency (matches existing patterns with AutoCloseable, synchronized methods)
Security vulnerabilities (no unsafe operations, proper credential handling)
Merge conflicts (clean)
Session lifecycle management (proper StreamingSession.close() implementation)
Integer overflow (already protected with checked_add())
Resource cleanup (proper native session cleanup via nativeCloseStreamingSession)

Overall Assessment: Excellent implementation with solid architecture and comprehensive safety measures. The single minor issue is a best practice recommendation rather than a functional bug.

…rite bug Three changes in this commit: 1. Remove fused (non-streaming) companion retrieval path entirely. The deprecated searchAndRetrieveArrowFfi() now delegates to the streaming session internally. Deletes fused_retrieval.rs and its JNI function. 2. Add Arrow FFI streaming for non-companion (regular tantivy) splits. New streaming_doc_retrieval.rs converts tantivy doc store documents to Arrow RecordBatches via doc_async(), reusing the same StreamingRetrievalSession infrastructure. The JNI function now routes companion splits through parquet streaming and regular splits through tantivy doc store streaming. 3. Fix streaming query rewrite bug (TANTIVY4JAVA_STREAMING_FAST_FIELD_BUG). perform_bulk_search() was missing the companion query rewrites that the regular search() path applies: FieldPresence → _phash_* rewrite and exact_only term → hash lookup rewrite. Without these, IS NOT NULL returned 0 rows and exact_only EqualTo returned nothing. Also adds ensure_fast_fields_for_query() call for parquet-transcoded fast fields. 4. Add regression tests for streaming + field presence, IS NULL, and exact_only queries comparing streaming results against search() baseline. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

claude · 2026-03-13T15:33:27Z

Claude finished @schenksj's task —— View job

Claude is working on @schenksj's task

Code Review Complete

Reviewing PR #128 for:

Correctness bugs (logic errors, off-by-one, null safety)
Security issues (injection, unsafe operations, credential handling)
Merge conflict markers or build-breaking issues
Missing error propagation (especially JNI error swallowing)
API consistency with existing patterns

Todo List

Review repository guidelines (CLAUDE.md)
Review new Rust modules for memory safety and correctness
Check JNI error handling and propagation
Validate Java API changes for consistency
Look for potential security vulnerabilities
Check session management and resource cleanup
Review streaming pipeline implementation
Validate Arrow FFI buffer management

Summary: Found 1 Minor Issue

✅ Good Architecture Design

Proper use of ARC_REGISTRY for session management (avoids use-after-free)
Sound streaming pipeline with bounded memory (~24MB max)
Appropriate separation of fused vs streaming paths
Good error handling in most areas
Integer overflow protection already implemented in docid_collector.rs with checked_add()
Proper bounds checking for file indices in streaming_ffi.rs:171-176

🟢 MINOR: Missing Bounds Check in Arrow-to-Tantivy Module

File: streaming_ffi.rs:197

let batches = read_parquet_batches_for_file(
    file_idx,
    &rows,
    projected_fields.as_deref().map(|s| s as &[String]),
    manifest,
    storage,
    metadata_cache,
    byte_cache,
    effective_coalesce,
).await?;

Issue: While bounds checking exists at line 171-176, the read_parquet_batches_for_file function receives file_idx that could potentially access arrays without additional validation in the callee.

Risk: Potential panic if the downstream function doesn't validate the file index.

Recommendation: Verify that read_parquet_batches_for_file includes its own bounds validation, or add a comment confirming that validation occurs at line 171-176.

✅ No Issues Found In:

Memory safety (proper Arc management with ARC_REGISTRY)
JNI error propagation (comprehensive error handling with to_java_exception)
API consistency (matches existing patterns with AutoCloseable, synchronized methods)
Security vulnerabilities (no unsafe operations, proper credential handling)
Merge conflicts (clean)
Session lifecycle management (proper StreamingSession.close() implementation)
Integer overflow (already protected with checked_add())
Resource cleanup (proper native session cleanup via nativeCloseStreamingSession)

Overall Assessment: Excellent implementation with solid architecture and comprehensive safety measures. The single minor issue is a best practice recommendation rather than a functional bug.

schenksj changed the title ~~Add large result set retrieval (fused + streaming)~~ Large result set retrieval: streaming-only, all split types, query rewrite fix Mar 13, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large result set retrieval: streaming-only, all split types, query rewrite fix#128

Large result set retrieval: streaming-only, all split types, query rewrite fix#128
schenksj wants to merge 3 commits intomainfrom
feature/large-result-set-retrieval

schenksj commented Mar 13, 2026 •

edited

Loading

Uh oh!

claude bot commented Mar 13, 2026 •

edited

Loading

Uh oh!

claude bot commented Mar 13, 2026 •

edited

Loading

Uh oh!

claude bot commented Mar 13, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

schenksj commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes since initial PR

Files changed

Test plan

Uh oh!

claude bot commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review Complete

Todo List

Summary: Found 4 Issues (1 Critical, 2 Moderate, 1 Minor)

✅ Good Architecture Design

🔴 CRITICAL: Session Handle Memory Leak

🟡 MODERATE: Integer Overflow Risk

🟡 MODERATE: Silent Error Swallowing in Stream

🟢 MINOR: Missing Bounds Check

✅ No Issues Found In:

Uh oh!

claude bot commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review Complete

Todo List

Summary: Found 1 Minor Issue

✅ Good Architecture Design

🟢 MINOR: Missing Bounds Check in Arrow-to-Tantivy Module

✅ No Issues Found In:

Uh oh!

claude bot commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review Complete

Todo List

Summary: Found 1 Minor Issue

✅ Good Architecture Design

🟢 MINOR: Missing Bounds Check in Arrow-to-Tantivy Module

✅ No Issues Found In:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

schenksj commented Mar 13, 2026 •

edited

Loading

claude bot commented Mar 13, 2026 •

edited

Loading

claude bot commented Mar 13, 2026 •

edited

Loading

claude bot commented Mar 13, 2026 •

edited

Loading